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Abstract 

The multinational SYSCILIA consortiunn aims to gain a mechanistic understanding of the cilium. We utilize multiple 
parallel high-throughput (HTP) initiatives to develop predictive models of relationships between complex 
genotypes and variable phenotypes of ciliopathies. The models generated are only as good as the wet laboratory 
data fed into them. It is therefore essential to orchestrate a well-annotated and high-confidence dataset to be able 
to assess the quality of any HTP dataset. Here, we present the inaugural SYSCILIA gold standard of known ciliary 
components as a public resource. 



Review 

High-throughput (HTP) experiments and their computa- 
tional analyses are becoming increasingly important as 
basic fundamental research tools. However, concerns have 
been raised with respect to the quality of the earliest com- 
parative analyses of genomics data [1]. For example, the 
quality of HTP experiments and their bioinformatic ana- 
lyses is typically undocumented and indeed often un- 
known. Quality, sensitivity and accuracy are important 
parameters to consider when deciding how to carry out 
HTP methods, determine cut-off thresholds and object- 
ively evaluate the results. Within the SYSCILIA consor- 
tium, we aim to systematically evaluate the quality of our 
HTP experiments, such as genome-wide siRNA screening, 
as well as develop powerftil bioinformatic tools and analyt- 
ical tools to exploit the large datasets produced by HTP 
procedures across multiple centers. Here, we present one 
such tool we have generated, the SYSCILIA gold standard 
(SCGS) of known ciliary genes. 

The SCGS is a standardized list of verified ciliary genes, 
which can be used as a reference dataset of cilia genes for 
quality metric analyses of experiments, and analyses 
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investigating the cilium and its components. This list is 
not meant to be comprehensive but rather to be highly 
reliable; we err on the side of caution to ensure that 
the genes in this publically available list all encode well- 
characterized ciliary components. Such a gold standard is 
a very powerful tool for the comparison of datasets pro- 
duced by HTP methods, allowing the quantification of the 
quality of our experiments in terms of sensitivity, specifi- 
city and related metrics (for example true positive rate 
and false discovery rate (FDR)). 

Within the field of cilia and ciliopathy research, existing 
sets of databases, such as Cildb [2] and Cilia Proteome [3], 
are already widely consulted and represent an immense 
asset to ciliary research. This is reflected by the frequency 
of use of these resources by many cilia research groups 
(cited 14 and 140 times, respectively, in Thomson Reuters 
Web of Knowledge, 22 May 2013). However, all studies 
contributing data to these databases are considered 
equally informative despite some studies likely suffering 
from a higher number of false positives than others. Ob- 
jective estimation of the quality or predictive power of 
each dataset would be a valuable addition. Calculating the 
sensitivity and specificity of each dataset will provide an 
objective indicator of whether to include or exclude 
datasets for a particular purpose, or how to weigh their 
contribution in Bayesian data integration. Additionally, 
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comparison of datasets to the SCGS can also facilitate de- 
termination of objective cut-off thresholds via receiver op- 
erator characteristic (ROC) curves. With the SCGS, we 
deliver a valuable resource to scientists in the wider field 
of cilia biology and anticipate a pivotal role for the SCGS 
in our multi-centre systems biology approach. 

The SYSCILIA gold standard of ciliary genes 

As a statistical tool, the SCGS needs to be a high- 
confidence list of sufficient size, but does not need to be 
comprehensive; the SCGS does not need to contain all 
possible ciliary genes to be effective. In order to obtain the 
most reliable results, the SCGS preferably needs to be free 
of experimental or other biases and contain no incorrectly 
assigned genes. For this reason, inclusion of genes based 
solely on recovery by single HTP experiments or sources 
with similar potentially high FDRs should be avoided; 
while genes extensively characterized as ciliary genes in in- 
dividual 'gene-specific' publications, or multiple publica- 
tions, are highly desirable. Nevertheless, the advantage of 
HTP results is that they offer a comprehensive starting 
point to start assembly, without the need to, for example, 
scan through the whole human genome for cilium genes. 
An efficient way of combining detailed expert cilia biology 
knowledge with the comprehensive nature of HTP experi- 
ments is to generate an automatically compiled gene list 
from potentially high quality datasets, curate it manually 
and combine it with expert knowledge for genes that were 
missed in the HTP experiments (Figure 1). 
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Figure 1 Flow diagram describing the processes to create 
tiie SCGS. 



To compile the SCGS we collected 27 ciliary studies 
[2,4-29] from Cildb [2], which holds the largest collec- 
tion of ciliary datasets (for an overview of the ciliary 
datasets see Additional file 1). Only datasets based on 
experimental methods were considered; datasets based 
on comparative genomics predictions were excluded. 
The remaining studies covered nine eukaryotic species. 
All datasets were mapped to human genes by combining 
two orthology methods, namely OrthoMCL [30] and 
InParanoid [31]. We only considered one-to-one 
orthologues between the species of a given dataset and 
human to avoid cases where after gene duplication, one 
of the daughter proteins no longer plays a role in the 
cilium. We defined one-to-one orthologues as defined 
by InParanoid when both genes are also contained 
within the same OrthoMCL orthologous group. If 
InParanoid did not report any human orthologues for a 
given gene, then the gene reported by OrthoMCL was 
taken. OrthoMCL performs better in retrieving distant 
homologues than InParanoid [32], which, with datasets 
from the distantly related species Trypanosoma brucei 
and Chlamydomonas reinhardtiU is particularly invalu- 
able. All other genes in the datasets were excluded, leav- 
ing 3,575 genes. The remaining list was then filtered in 
two ways: data mapped by orthology to a human gene 
was required to originate from at least two different spe- 
cies and to be shown to be ciliary-related in at least two 
types of experiments (for example in expression data 
and proteomics data). A total of 503 genes remained. 
Finally, a set of 97 medically relevant ciliopathy genes 
was added from Reeuwijk et al [33]. After removal of 
overlapping genes, this resulted in a total of 567 poten- 
tial ciliary genes. 

The resulting list of genes was then curated manually. 
Experts within the SYSCILIA consortium annotated genes 
as either 'known-ciliary; unlmownl or non-ciliary based 
on literature searches. Additionally, members submitted 
123 known ciliary genes to this list. Genes were con- 
sidered ciliary if evidence was published for ciliary 
localization (including basal body), function in ciliogenesis 
(including cilium-specific transcription) and involvement 
in ciliopathies. The final SCGS contains 303 curated 
ciliary genes. 

We are confident that, by combining experimental 
datasets, a good proportion of the SCGS can be retrieved 
by commonly used experimental methods. By requiring at 
least two types of experimental evidence we limit inclu- 
sion of experimental biases particular for one type of ex- 
periment, like mass spectrometry, which often fails to 
retrieve membrane proteins [34]. We put effort into anno- 
tating the localization of each gene in the SCGS and the 
SCGS covers all the cilium components (Figure 2). These 
annotations can be used to quickly compile subsets based 
on localization. 
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Figure 2 Schematic overview of ciliary components annotated in the gold standard. The schematic depiction of the eukaryotic cilium and 
its components as annotated in the SCGS, based on Basten et ol. [36]. The pie chart represents the occurrence of ciliary component localization in 
the SCGS. The numbers in the individual colors represent the number of individual entries for each location. Note that many genes have been 
ascribed multiple localizations. SCGS, SYSCILIA gold standard. 



Conclusion 

Currently, the SCGS is actively used within our con- 
sortium for purposes ranging from optimization of ex- 
perimental methods, to training and evaluating of 
bioinformatics tools, and as a reference resource. Because 
of its broad use and importance to the cilia community, 
we have made the SCGS publicly available (see Additional 
file 2 and http://www.syscilia.org/goldstandardshtml). 
Our list of known ciliary genes is not exhaustive and we 
expect that the number of newly identified ciliary genes 
will increase greatly over the next two years. The high 
stringency applied to the filtering of datasets has led to a 
small but high-confidence dataset, which we will continue 
to expand and improve on the basis of novel published 
cilium genes. Regular updates of the SCGS can be 
accessed at our consortium website. For many of the met- 
rics discussed above a negative control dataset is also re- 
quired, that is a list of validated non-ciliary genes. We will 
also endeavor to make a negative control dataset available 
in the future. However, it is hard to definitively prove that 
a gene is never cilia-associated and some genes assigned 



as negative controls will likely change with new insights. 
A negative set is therefore volatile; nevertheless SYSCILIA 
has also recently published such a resource for negative 
protein-protein interactions [35] . 

We invite everyone to contribute or curate new and 
known ciliary genes, to combine and further our collect- 
ive knowledge on ciliary biology, and use the SCGS to 
enhance research. 

Availability of supporting data 

The SYSCILIA gold standard is provided as an excel file 
in the supplementary material and online at http://www. 
syscilia.org/goldstandard.shtml. 

Additional files 



Additional file 1: Table of ciliary datasets used to compile the gene 
list, and curate the SCGS and references. 

Additional file 2: Excel spread sheet of the SYSCILIA gold standard 
version 1 (SCGSvl), listing 303 curated genes involved in ciliary 
biology and listing potential ciliary genes. 
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